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[57] ABSTRACT 

A text classification system and method that can be used 
by an application for classifying natural language text 
input into a computer system having a domain specific 
knowledge base that includes a knowledge base having 
a plurality of categories. The text classification system 
classifies input natural language input text by first pars- 
ing the natural language input text into a first list of 
recognized keywords.' This list is then used to deduce 
further facts from the natural language input text which 
are then compiled into a second list Next, a numeric 
similarity score for each one of the plurality of catego- 
ries in the knowledge base is calculated which indicates 
how similar one of the plurality of categories is to the 
natural language input text A dynamic threshold is then 
applied to determine which ones of the plurality of 
categories are most similar to the recognized keywords 
of the natural language input text A third list is com- 
piled of the ones of the plurality of categories deter- 
mined to be most similar to the recognized keywords. 
An optional rule base can be utilized to further refine 
the determination of which ones of the plurality of 
categories are most similar to the recognized keywords 
of the natural language input text. Also, an optional 
learning capability can be added to improve the accu- 
racy of the text classification system. 

24 Claims, 6 Drawing Sheets 
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doing this, a knowledge engineer must spend a signifi- 

METHOD AND APPA RATUS FOR TEXT cant amount of time tuning and experimenting with the 

CLASSIFICATION rules to arrive at the correct set of rules to ensure that 

the rules work together properly for the desired appli- 

FTF.LD OF THE INVENTION 5 cation. 

The present invention is directed to text classification Another shortcoming in the foregoing systems is that 

and, more particularly, to a computer based system for there is no built in mechanism to allow the knowledge 

text classification that provides a resource that can be base portion of a text clas s i fi ca ti on system to learn from 

utilized by external applications for text classification. the input text over time to thereby increase system 

10 accuracy. The addition of a learning component to 

BACKGROUND OF THE INVENTION enhance the accuracy of a text classification system 

The growing volume of publicly available, machine- would be desirable to improve the performance of the 

readable textual information makes it increasingly nec- system over time. 

essary for businesses to automate the handling of such CTT > ~ x A - v „ ixn^xmnv 

information to stay competitive. By automating the « SUMMARY OF THE INVENTION 

handling of text, businesses can decrease costs and in- The present invention provides a method and system 

crease quality in performing tasks that require access to for performing text classification. Specifically, the sys- 

textual information. tern provides a core structure that performs text classifi- 

A commercially important class of text processing cation for external applications. It provides a core run 

applications is text classification systems. Automated 20 time engine for executing text classification applications 

text classification systems identify the subject matter of around which the knowledge needed to perform text 

a piece of text as belonging to one or more categories classification can be built 

from a potentially large predefined set of categories. Generally, the operating environment of the present 
Text classification includes a class of applications that invention includes a general purpose computer system 
can solve a variety of problems in the indexing and 25 which ^^^5^ a central processing unit having mem- 
routing of text . ory t and associated peripheral equipment such as disk 
Routing of text is useful m large organizations where to ^ves ^ & ]& terminals. The 
there is a large volume of individual pieces of text that s of ^ t mvcntion res ides in either mem- 
needs to be sent to specific persons <e.g., technical sup- Qr one rf ^ e devices It k ^ b 
port specialists inside a large customer support center). 30 runn]n *^ centrai processing unit to 
Indexing text is useful m attaching topic labels to infor- ^ * Pledge bas£ is mainLned on 
mation and partitioning the information space to aid , J . ° A • A . 
. . . r A . , 7 _ r •«» . the disk drive or some other storage medium in the 
information retrieval. Indexing can facilitate the re- ~T ^ ^ 

trieval of information based upon the contents of text u er system. ... « 

rather than boolean keyword searches from databases 35 ^ method of classifying text according to the pres- 
that include information such as news articles, federal rat pennon begins upon acceptance of natural Ian- 
regulations, etc guage mput text which can be supplied by an external 

A number of different approaches have been devel- application. The input text is then parsed into a first list 
oped for automatic text processing. One approach is of recognized keywords which may include, e.g., 
based upon information retrieval techniques utilizing 40 words, phrases and regular expressions. The first list is 
boolean keyword searches. This approach, however, used to deduce further facts from the natural language 
has problems with accuracy. A second approach bor- m P ut text which m classifying the input text 

rows natural language processing from artificial intelli- The deduced facts are then compiled into a second list 
gence technology to achieve higher accuracy. While Then, utilizing the first list, the present invention calcu- 
natural language processing improves accuracy based 45 ktes a numeric similarity score for each one of a plural- 
upon an analysis of the meaning of input text, speed of itv of categories in the knowledge base which indicates 
execution and range of coverage becomes problematic how similar one of the plurality of categories is to the 
when such techniques are applied to large volumes of recognized keywords in the first list A dynamic thresh- 
text old 1 5 then applied to determine which ones of the cate- 

Others have recognized the foregoing shortcomings 50 gories are most similar to the recognized keywords of 
and have attempted to reach a middle ground between the natural language input text The result is a third list 
information retrieval techniques and natural language/- which includes the categories that the recognized key- 
knowledge-based techniques to achieve acceptable ac- words are most similar. At this point, the text classifica- 
curacy without sacrificing speed of execution or range tion operation of the present invention is complete and 
of coverage. This has been accomplished through pre- 55 the first, second and third lists can be passed on to the 
dominantly rule based systems which parse the input external application for application specific processing, 
text using natural language morphology techniques, The architecture of the text classification system of 
attempt to recognize concepts in die text, and then use the present invention comprises a natural language 
a rule base to map from identified concepts to catego- module, an intelligent inferencer module and a similar- 
ries. 60 rty measuring module. The natural language module 

Text classification systems which rely upon rule-base extracts as much information as possible directly from 
techniques also suffer from a number of drawbacks. The natural language input text received by the text classifi- 
most significant drawback being that such systems re- cation system from an external application. The intelli- 
quire a significant amount of knowledge engineering to gent inferencer module deduces any and all relevant 
develop a working system appropriate for a desired text 65 information that is implicitly contained in the natural 
classification applicatioa It becomes more difficult to . language input text The similarity measuring module 
develop an application using rule-based systems because calculates a numeric similarity score for each one of the 
all the requisite knowledge is placed into a rule base. By plurality of categories and applies a dynamic threshold 
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to the plurality of categories to ascertain which catego- FIG. 6 illustrates an exemplary implementation of the 

ries are potentially most similar to the natural language category disambiguation module illustrated in FIG. 2. 

input text DETAILED DESCRIPTION 

A domain specific knowledge base comprising a lexi- ~ 
con of keywords, a class hierarchy organization for 5 Referring now to the drawings, and initially to FIG. 
keywords and a class hierarchy organization for catego- 1, there is illustrated an exemplary embodiment of a 
ries is utilized by the text classification system of the system for implementing the present invention. The 
present invention. The knowledge base is provided by system 10 comprises a computer 12 having a memory 22 
the external application that is utilizing the text classifi- associated therewith and with associated peripheral 
cation system of the present invention. By allowing 10 equipment such as a disk drive and storage unit 14, a 
keyword and category classes, the present invention tape drive 16 and a video display terminal 18. The com- 
simplifies the maintenance of the lexicon and an op- puter 12 is generally any high performance computer 
tional rule base. Accuracy is also improved by allowing such as a Digital Equipment Corporation VAX 
multiple facts to be inferred from single keyword 6000-100. In conjunction with the computer 12, a do- 
classes. IS main specific knowledge base 20 that includes applica- 

An optional category disambiguation module can be tion-specific information is stored on the disk drive 14 
added to the system of the present invention to further and an application program 24 is stored in the memory 
refine the results obtained by the similarity measuring 22. 

module. Under such circumstances, the domain specific Referring now to FIG. 2, there is illustrated an exem- 
knowledge base can be adapted to include the optional 20 plary architecture for a text classification system 30 of 
rule base. By making the category disambiguation mod- the present invention. The system 30 comprises a natu- 
ule and the rule base optional, the present invention ral language module 32, an intelligent inferencer mod- 
provides a text classification application developer ule 34 and a similarity measuring module 36. An op- 
more flexibility by allowing the developer to decide tional category disambiguation module 38 and an op- 
whether or not to include the rule base. While eliminat- 25 tional relevance feedback learning module 40 are also 
ing the category disambiguation module and the rule shown in FIG. 2. Modules 32, 34 and 36 (and 38, if 
base may result in some loss of accuracy, the trade-off selected to be part of the system) comprise what is 
would be that development of an application is greatly hereinafter referred to as **the run time system" of the 
simplified. present invention. These modules are referred to as the 

If, however, an application developer decides to uti- 30 run time system because collectively, they are invoked 
lize the category disambiguation module and the rule- by the computer 12 (FIG. 1) to process and classify 
base, the task is simple and straightforward because natural language text received from an external source, 
most of the processing and comparison of the input text e.g., the application 24. 

is performed upstream in the architecture thereby FIG. 3 illustrates the system 30 of FIG. 2 with the 
greatly reducing the importance of the rule-base in the 35 domain specific knowledge base 20 of FIG. 1. As illus- 
text classification process. trated in FIG. 1, the knowledge base 20 is shown as 

The system of the present invention can also be being stored on the disk drive 14. It should be under- 
adapted to include an optional relevance feedback stood that it could also be stored in the memory 22 
learning module as an add-on to the system of the pres- (FIG. 1) or any other appropriate storage device cou- 
ent invention to learn over time to increase system accu- 40 pled to the computer 12. The knowledge base 20 is 
racy. It can operate independently of the text classifica- external to the system 30. The information stored in the 
tion system, e.g., in a batch mode. The relevance feed- knowledge base 20 is provided by an applications pro- 
back learning module utilizes information passed to it grammer who is charged with developing the apphca- 
by the system to adjust values stored in a category tion 24 that is utilizing the system 30 to perform text 
profile knowledge base. Such information may include 45 classification functions. The modules which comprise 
a category determined most relevant to a given natural the domain specific knowledge base 20 are a lexicon 52, 
language input text, a category determined most rele- a keyword class hierarchy 54, keyword/category pre- 
vent to the same natural language input text by an exter- files 56 and an optional category selection rule base 58 
nal source, e.g., a human expert, (the categories may or (utilized when the optional category disambiguation 
may not be the same), and a list of keywords that pro- 50 module 38 is used). 

vide evidence for the categories selected along with the Each of the modules of the system 30 and the compo- 
amount of evidence they provide. nents of the knowledge base 20 are briefly discussed 

below. 

BRIEF DESCRIPTION OF THE DRAWINGS ^ ^ ofthe mtmsd i^g^ mo dule 32 

FIG. 1 illustrates an exemplary computer system for 55 is to extract as much information as possible directly 
implementing a text classification system according to from natural language input text The input text can be 
the present invention. any machine-readable natural language text as deter- 

FIG. 2 illustrates an exemplary architecture of the mined by the external application 24. The natural lan- 
modnles utilized in a text classification system accord- guage module 32 uses the lexicon 52, which comprises 
ing to the present invention. 60 keywords which can include, for example, words, 

FIG. 3 shows the exemplary architecture illustrated phrases, and regular expressions, to identify all recog- 
in FIG. 2 together with an exemplary domain specific nized keywords in the natural language input text Spe- 
knowledge base, cifically, the module 32 extracts all the relevant infor- 

FIG. 4 illustrates an exemplary portion of the key- mation that is explicitly contained in the input text The 
word class hierarchy. 65 natural language module 32 passes a list of all the recog- 

FIG. 5 illustrates an exemplary embodiment of the nized keywords to the intelligent inferencer module 34. 
modules that comprise the intelligent inferencer module An example of a natural language module of the type 
illustrated in FIG. 2. described above is disclosed in U.S. patent application 



06/16/2003, EAST Version: 1.04.0000 



5,371,807 

5 6 

Ser. No. 07/729,445, entitled "Method and Apparatus disambiguation module 38. The module 38 uses the 

for Efficient Morphological Text Analysis using a High category selection rule base 58 to select certain catego- 

Level Language for Compact Specification of Inflec- ries over other categories based on the list of recognized 

tional Paradigms," (hereinafter "the Morphological keywords and the list of deduced facts. This module 38 

Text Analysis patent application") filed Jul. 12, 1991 5 further refines the list of the most ^milar categories and 

and assigned to Digital Equipment Corporation. This passes it, along with the list of recognized keywords and 

application is expressly incorporated herein by refer- the list of deduced facts to the external application 24 

ence. and, if desirable, the optional relevance feedback learn- 

The list of recognized keywords passed to the intelli- ing module 40. 
gent inferencer module 34 is used to deduce any and all 10 The relevance feedback learning module 40 is an 
relevant information that is implicitly contained in the add-on to the run time system of the present invention, 
input text To accomplish this task, the intelligent in- It can operate independently of the run time system, 
ferencer module 34 uses the keyword class hierarchy 54 e.g., in a batch mode. The input to the relevance feed- 
to deduce further facts from the information explicitly back learning module 40 comprises the category deter- 
stated in the input text Keywords are grouped into IS mined most relevant to a given input text, the category 
classes in the keyword class hierarchy 54. Each class determined most relevant to the same input text by an 
has associated facts that are true when a member of the external source, e.g., a human expert, (the categories 
class is identified in the input text may or may not be the same), and the list of recognized 

For example, the input text may mention problems keywords that provide evidence for the categories se- 

with a specific type of disk device but not explicitly 20 lected along with the amount of evidence they provide, 

mention that the problems are with a disk. The keyword The module 40 then takes this information and adjusts 

class hierarchy 54 can include a class called ''DISK the profile weights in the keyword/category profiles 56 

DEVICES" with specific disks as members. The fact accordingly. 

"(DEVICE TYPE = DISK)" can be attached to this The task that the natural language module 32 of the 
class. When a specific disk device is identified, the fact 25 run time system of the present invention performs is to 
"(DEVICE TYPE = DISK)'* can be inferred even extract all the relevant information that is explicitly 
though the word "disk" was not explicitly mentioned in contained in a natural language input text To accom- 
the input text The intelligent inferencer module 34 also plish this task, the module 32 uses the lexicon 52. The 
performs word substitutions in key phrases. The intelli- lexicon 52 contains all the information that is considered 
gent inferencer module 34 passes the list of recognized 30 relevant for extraction purposes, 
keywords and a list of all the extra facts that could be A brief description of the processing performed by 
deduced from the recognized keywords to the similarity the natural language module 32 is set forth below. For 
measuring module 36. a complete description, the reader is referred to the 
The list of recognized keywords extracted from the Morphological Text Analysis patent application, re- 
input text passed to the similarity measuring module 36 35 ferred to above, which is expressly incorporated herein 
is used to calculate a numeric similarity score for each by reference. 

predefined category. Each score indicates how similar a The natural language module 32 allows the inclusion 

given category is to the input text The similarity mea- of single word nouns and multiple word noun phrases 

suring module 36 uses a knowledge base of keyword/- into the lexicon 52. The natural language module 32 will 

category profiles 56 to determine the similarity score. 40 recognize the root form of a noun or noun phrase as 

Each category in the knowledge base of keyword/cate- well as morphological variants of the root, eg., plural 

gory profiles 56 has an associated profile. The profile form of the root noun or noun phrase. It also allows 

tells the similarity measuring module 36 which key- synonyms of a keyword to be entered into the lexicon 

words provide evidence for the given category. Associ- 52 which are useful when defining keyword classes or 

ated with each keyword in a profile is a numeric weight 45 when writing disambiguation rules, 

called a "profile weight" that tells the similarity measur- Single word verbs can also be included in the lexicon 

ing module 36 the amount of evidence a keyword pro- 52. The root form of a verb must be entered into the 

vides for the given category. The module 36 determines _ lexicon 52. This way, the module 32 will not only rec- 

profile weights and combines the profile weights to ognize the root form, but morphological variants as 

arrive at similarity scores for all the categories. Once 50 well. For example, the verb "crash" in the lexicon 52 

die similarity scores have been calculated, a dynamic will identify "crashes", "crashing*', and "crashed", 

threshold is applied to all of the categories defined in . A limited form of multiple word verb phrases are 

the domain specific knowledge base 20. Those catego- allowed into the lexicon 52. In this case, a verb phrase 

ries whose similarity scores are below the threshold are is considered to be a single word verb combined with a 

discarded from consideration as being potentially most 55 single word noun or noun phrase subject/object (eg., 

similar to the input text The categories whose similarity "Analyze Disk"). 

scores are above the threshold are compiled into a list When keyword matching is performed for a verb 

and are passed to the next module or directly to the phrase, each sentence in the input text is reviewed sepa- 

external application 24 (not shown), along with the list rately. For each sentence, the natural language module 

of extracted keywords and the list of deduced facts, if 60 32 tries to find the verb contained in the verb phrase. If 

there are any. the verb is found, it then looks to see if the noun or noun 

The list of most similar categories, the list of ex- phrase contained in the verb phrase is present in the 

tracted keywords, and the list of deduced facts, if any, sentence. If both the verb and the noun phrase are found 

can then either be passed directly out to the external in the same sentence, then the entire verb phrase has 

application 24, to the optional category disambiguation 65 been identified. For example, if the lexicon 52 contains 

module 38 or to the optional relevance feedback learn- the verb phrase "Analyze Disk." One of the sentences 

ing module 40. If a rule base is desired for a particular in the input text that the present invention is parsing is 

application, the information is passed to the category the following: "I need help analyzing this damaged 
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disk." The natural language module 32 will first identify pie, it may be desirable to match the phrase "Analyze 

the single keywords "analyze" and "disk" (analyzing is Disk" every time "Analyze X" is detected, where X is 

a morphological variant of analyze). Then it will notice a specific disk device. This is accomplished without 

that "analyze" is the verb part of a verb phrase. It will having to enter a single verb phrase for every specific 

then search the list of recognized keywords for that 5 disk device into the lexicon 52 which would cause the 

sentence for the noun part of the phrase (in this case the maintenance of a lexicon to become problematic, 

word "disk"). Since "disk" is in the keyword list the Using keyword substitution, a group of like devices 

present invention then identifies the verb phrase "Ana- can be grouped into a class and a word attached to the 

jyze Disk." The process works exactly the same way for class to be used as a substitute for matching phrases in 

multiple word noun phrases inside the verb phrase (e.g., 10 the lexicon 52. Going back to the example above, a class 

"Analyze Process Dump," instead of "Analyze Disk"). of disk devices can be defined and the keyword "disk" 

The lexicon 52 can also include single word regular can be associated as a substitute. This way, "Analyze 

expressions. If a regular expression is in the lexicon 52, RD54" (where RD54 is a model number of a disk drive) 

then the natural language module 32 will identify any can be recognized as "Analyze Disk" without having to 

word in the input text that matches against the regular 15 have "Analyze RD54" stored in the lexicon 52. 

expression. Being able to define regular expressions in The output of the intelligent inferencer module 34 is 

the lexicon 52 gives the maintain er of the lexicon 52 the list of all the extracted keywords and the list of all 

more flexibility than being restricted to defining literal the. deduced facts that the intelligent inferencer module 

words and phrases. For example, the term "SYS$m-f " 34 was able to infer. Associated with each extracted 

can be defined to match all the VMS (an operating 20 keyword is a number designating the frequency of the 

system available from Digital Equipment Corporation) keyword in the input text 

operating system service routines instead of having to An exemplary embodiment of the modules which 

enter the name of .every operating system service rou- comprise the intelligent inferencer module 34 are shown 

tine directly into the lexicon 52. in FIG. 5. The left hand side of FIG. 5 shows the two 

Some of the syntax rules of the singular expressions 25 main modules of the intelligent inferencer module 34, a 

allowed in the lexicon 52 are that an ordinary character fact inferencer module 60 and a keyword substitution 

matches that character; a period matches any character; module 62. The right hand side of FIG. 5 shows that 

a colon matches a class of characters described by the both modules 60 and 62 use the keyword class hierarchy 

following character, e.g., ":a" matches any alphabetic 54 (the same one illustrated in FIG. 3) as their knowl- 

":d" matches digits, ":n" matches alphanumeric^; an 30 edge base. The fact inferencer module 60 only utilizes 

expression followed by an asterisk matches zero or the facts associated with the classes in the keyword 

more occurrences of that expression e.g., "fo*" notches class hierarchy 54 and the keyword substitution module 

"f" *Yo" "foo", etc. and an expression followed by a 62 only uses the keyword substitutes associated with the 

plus sign matches one or more occurrences of that ex- classes in the keyword class hierarchy 54. 

pression, e.g., "fo+" matches 'Too, etc." 35 The fact inferencer module 60 follows a general 

The output of the natural language module 32 is a list method for attaching facts to keywords. This method, 

which is a collection of sublists where each sublist cor- which is repeated for each keyword K, first searches 

responds to a single sentence in the input text and con- the keyword class hierarchy 54 for all classes C, of 

tains all the recognized keywords in that sentence. This which the identified keyword is a member. Then, all 

list is passed to the intelligent inferencer module 34 for 40 facts associated with C are added to a global list of 

further analysis and possible augmentation as is de- deduced facts for each identified class C that K is a 

scribed below. member. The step of adding all facts associated with the 

The intelligent inferencer module 34 takes the infor- identified class C is then applied recursively on all of the 

mation extracted directly from the input text by the parent classes of C. By following this method, the fact 

natural language module 32 and attempts to add to that 45 inferencer module 60 adds facts to the list of deduced 

information by deducing further facts that are implied facts. 

by the keywords identified. This module 34 uses the The keyword substitution module 62 similarly fol- 
keyword class hierarchy 54. Each class in the keyword lows a general method for substituting keywords. This 
class hierarchy 54 contains a group of keywords (al- method, which is repeated for each keyword K, first 
ready defined in the lexicon 52) that share something in 50 searches the keyword class hierarchy 54 for all classes 
common. The classes are structured into a hierarchy C, of which K is a member. Then, all the substitution 
such that classes themselves can be members of other keywords S, associated with C are retrieved for each 
classes. An exemplary portion of the keyword class identified class C where K is a member. Then, S is 
hierarchy 54 is illustrated in FIG. 4. substituted for K and an attempt is made to match verb 
What is useful about these classes is that facts can be 55 phrases in the lexicon 52. If a match is found, it is added 
attached to them to deduce implied information if a to a global list of identified keywords. Then, the steps of 
member of a class is found in the input text If a key- retrieving substitution keywords and substituting key- 
word class member is identified, then all the facts at- words are recursively applied on all of the parent 
tached to that class are inferred and added to the list of classes of C 

deduced facts. In addition, all the facts attached to the 60 The similarity measuring module 36 is responsible for 

parent classes are inferred and added to the list of de- returning a numeric similarity score for each category 

duced facts as well. in the keyword/category profile 56. Eac h score indi- 

In addition to inferring new facts with keyword cates how similar a given category is to the recognized 

classes, more general descriptions of an identified key- keywj^^ex tacted from the natural language input 

word can be substituted in an attempt to match other 65 texfT The similarity measuring module 36 uses the 

key phrases. This process is called "keyword substitu- knowledge base of keyword/category profiles 56 to 

tion." It is an attempt to match key phrases in the lexi- determine similarity scores for all of the categories 

con 52 that could not be matched explicitly. For exam- defined. Each category in the keyword/category pro- 
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files 56 has its own profile containing the keywords that word occurs. The profile weight calculation formula is 

are relevant to that category. Once the input text is as follows: 
parsed by the natural language module 32 and the intel- 
ligent inferencer module 34, a list of all the keywords PW=h&CAT/CF) 
present in the input text, as well as the number of times 5 

they occur in the input text called term frequency, is where CAT equals the total number of defined catego- 

assembled by the similarity measuring module 36. The ries and CF equals the collection frequency of the given 

category profile can be represented as a n-dimensional keyword (this formula uses only collection frequency), 

vector of the form C=(cl, c2, . . . , cn), where n equals Note that as CF increases, the profile weight decreases, 

the total number of possible keywords in the lexicon 52 10 This makes sense because if a keyword provides evi- 

and the individual elements "ci" represents the cone- dence for a large number of categories then its profile 

sponding profile weight of keyword "i" in the category weight should be lower than a keyword that provides 

profile. The input text can also be represented as a n- evidence for a small number of categories, 
dimensional vector of the form T=(tl, t2, . . . t tn), The keyword weight calculation formula is as fol- 

where n is as above and **ti" represents the correspond- 15 lows: 
ing weight of keyword "i" in the input text Similarity 

between a category and an input text can then be mea- KW={TF * h&CAT/CFft/CKW 

sured as the inner product between these corresponding _ m , 

vectors, which is defined as: where CAT and CF are as above, TF equals the term 

20 frequency of the keyword in the input text, and CKW is 

Sim(QT)=SUM(i= i,n) (d • ti> the combined keyword weight and is calculated as fol- 
lows: 

The size of n can vary depending on the size of the 
keyword lexicon 52. c/^^=square_ROOT(SUM(/= i n) 

The similarity measuring module 36 includes a f 5 (SQUARE(//*MC4r/^)) 
^ method for efficiently computing the inner product 
similarity measure so that when n becomes large the 
similarity measures can still be quickly calculated. The 
method assumes that each keyword in the lexicon 52 has 



a corresponding vector of categories that it provides f° as previously defined. 



evidence for and a profil e weight for_each-category. 
This information can be quickly computed from the 
category profile vectors described above. This is ac- 
complished by first initializing all similarity scores for 
all categories to zero. Then, for each keyword i identi- 
fied in the input text and for each category j in the 
category vector of the keyword i, the keyword weight 
of keyword i is multiplied by the profile weight of cate- 
gory j. Then, the resulting product is added to the simi- 



where n is the total number of keywords found in the 
input text, "tfi" and "cfi" are the term and collection 
frequencies for one of the found keywords, and CAT is 



Once similarity scores have been calculated for all 
categories, the similarity measuring module 36 applies a 
dynamic threshold to the list of categories. This thresh- 
old is a given tuneable offset from the similarity score of 
the most similar category. In other words, if N is the 
highest similarity score for the input text and M is the 
pre-defined threshold offset, then N— M is the thresh- 
old value. All categories whose similarity scores are 
below the threshold value are discarded and those 



iarity score for the category j. above ^ threshold value ^ compiled into a list and 

The foregoing method insures that only the identified P 3 ^ to next module, along with the list of recog- 
keywords and the categories they provide evidence for mzed keywords and the list of deduced tacts, 
are being multiplied together. All the other portions of ^ described above, the foregoing results can be 
the inner products will equal zero anyway since the 4S V^sed directly to the external application 24, to the 
keyword weights will be zero (Le., the keywords were relevance feedback learning module 40 or the category 
not identified in the input text). The run time perfor- disambiguation module 38. If the information is passed 
mance of this method is significantly better than per- to optional category disambiguation module 38, it 
forming a straight summation of the products of the uses * rulejbase tp,sele ct certain cate gories over_othgr 
vector elements because of the large number of ele- 50 categ ories based on the list of re^ gnized^eywords^and 
ments equaling zero in the vectors. the list o f deduced facts . Rules are utilized to decide tthe 

Like keywords, categories can also be grouped into appropriate ca tegory whenhig^phan~one^^ 
hierarchically structured classes. This feature allows a poten ^"candidate for beingttie mosTsSuar.>ine left 
lexicon maintainer to define category class profiles as hand sides of the rules consist of CATEGORY and 
well as category profiles. The run time system of the 55 KEYWORD slot-value pairs and deduced facts. The 
present invention automatically translates category right hand sides of the rule merely assert a preselected 
class profiles into individual category profiles and in- preference for one category over another category (or 
corporates them into existing category profiles. Cate- set of categories). 

gory classes are also useful when writing disambigua- An example of a rule that could be used by the cate- 
tion rules. By having category classes, a single rule can 60 gory disambiguation module 38 is set forth below, 
operate on an entire class rather than writing individual 

rules for each category in a class. _ (category vm fuje^systemi 

The initial weights for category profiles and keyword [category = tx =^cfD^T-VAX 

weights for input texts are ascertained by formulae used vms-tape)) 
by the similarity measuring module 36 that uses both 65 (device type =» disk) 

term frequency and collection frequency as input In SS^^i^SffrEM o vm ?x 

text classification terms, collection frequency is the - THEN " prefer VMS-FiLEnSYSTEM over tx 

number of category profiles in which a specific key- 
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This rule states that if VMS-FILE-SYSTEM and tions where the particular category was identified as the 

either DECNET-VAX or VMS-TAPE are potentially most similar are collected. Next, the input texts which 

most similar categories, and if the feet (DEVICE_TY- were correctly classified and which were not are deter- 

PE=DISK) was deduced by the intelligent mferencer mined. Then, the keyword weights for all the correctly 

module 34, and if the keyword ACP has been found (or S classified input texts are added to the corresponding 

one of its synonyms); then the category VMS-FILE- keyword profile weights in the category profile. Fi- 

SYSTEM will be preferred over either DECNET- nally, the keyword weights for all the incorrectly classi- 

VAX or VMS-TAPE. fled input texts are subtracted from the corresponding 

When the category disambiguation module 38 is in- keyword profile weights in the category profile. Also, 

voked, all the rules that can apply to the given input text 10 the correct category is determined and the keyword 

are fired and all the category preferences are recorded weights are added to the profile of that category. An 

by the category disambiguation module 38. As a result example of an application that may use the text classifi- 

of the firing of the rules, the list of categories whose cation system of the present invention is routing of 

similarity scores are above the threshold value is modi- customer service requests within a customer support 

fied to include only the most similar categories that do 15 center. Without an automated text classifier, human call 

not have any other category with preference over them. screeners interact with a call handling system and deter- 

This list, along with the list of recognized keywords and mine the appropriate group to send a customer service 

the list of deduced facts, is then passed to the applica- request A call handling system records all the pertinent 

tion 24 (and to the relevance feedback learning module information that a support specialist needs to solve the 

40)- 20 customer problem. With an automated text classifier, 

As described above, the category disambiguation the call handling system can automatically invoke the 

module 38 is detachable from the run time architecture text classification system of the present invention to 

of the present invention. If a particular text classifica- determine where to route the customer service request 

tion application has no heuristics for category selection, without human intervention. 

then the category disambiguation module 38 can be 25 The following section provides an example of a call 
bypassed and reliance can be placed solely on similarity handling application for a given customer service re- 
scores calculated by the similarity measuring module quest and shows how the individual modules of the text 
36, to determine the most similar category. Detaching classification system of the present invention operate on 
the rule base will most likely result in a decrease in the the customer service request The output of the text 
accuracy of the classification; but for some applications 30 classification, system enables the application to route the 
no such rule base exists. By making the rule base detach- customer service request to the appropriate group. Al- 
able, the range of potential applications that can be though not shown here because it is application specific 
developed using the present invention is increased. processing, the call handling system would take this 
An exemplary implementation of the category disam- output and automatically send the customer service 
biguation module 38 is illustrated in FIG. 6. The top 35 request to the identified support group, 
portion of FIG. 6 shows the compile time processing Set forth below is an explanation of the processing 
needed to translate the category selection rule base 58 performed by the system of the present invention using 
into the proper syntax for a run time inference engine, an example natural language text input The example 
such as "CLIPS," which is a public domain inference text input is: 

engine developed by NASA. A rule compiler 68 takes 40 "While trying to backup my database to a TK70, the 

as input a category selection rule base 64 and category process died with the error AUDISABLED and 

class hierarchy 66. At run time, all the recognized key- produced a dump file. I need help analyzing the 

words, facts, and most similar categories (that are given dump file and getting the backup to work." 

as input to the category disambiguation module) are This input text is passed to the run time system by the 

translated into CLIPS facts 72 and are given as input 45 external application 24 in machine readable form. The 

(along with a CLIPS rule base 70) to the CLIPS infer- following explanation demonstrates how the present 

ence engine 74. The CLIPS inference engine 74 fires as invention processes this input text 

many rules as it can against the given facts. Each rule As discussed above, the processing begins with the 

firing returns a category preference. Once all the rules natural language module 32. The natural language mod- 

that can fire have fired, then all the category prefer- 50 ^ 32 utilizes the lexicon 52 to recognize words or 

ences are collected and used by the present invention to phrases in the natural language input text Set forth 

come up with a final list of most similar categories (as below is an example of entries in the lexicon 52 Each 

described above). entry in the lexicon 52 has a corresponding identifier 

Once text classification is done and the information is which defines the entry type. For example, 

passed to the external application 24 for further applica- 55 "BACKUP" is identified as a verb and a noun in sepa- 

tion specific processing, the optional relevance feed- entries, 
back learning module 40 can be invoked to adjust the 
keyword/category profile weights to achieve better 
accuracy. The module 40 collects all the text classifica- 
tions over a predetermined period from either the simi- 60 
larity measuring module 36 or the optional category 
disambiguation module 38, whichever is the last module 
of the run time system. The classifications include the 
input text, the chosen most similar category, and the 
keyword weights for the extracted keywords. Then, the 65 

relevance feedback learning module 40 performs the Given the entries in the lexicon 52, the natural lan- 

following tasks for each category profile in the key- guage module 32 identifies the following keywords and 

word/category profiles 56. First, all the text classifica- phrases from the given natural language input text: 



BACKUP 


VERB 


BACKUP 


NOUN 


DATABASE 


NOUN 


AUDISABLED 


NOUN 


DUMP FILE 


NOUNPHRASE 


ANALYZE DUMP 


. VERB PHRASE 


TKJM5+ 


REGULAR EXPRESSION 
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backup (twice, once as a verb and once as a noun), 
database, AUDISABLED, TK70 (matched against the 
regular expression), dump file, and analyze dump. Two 
interesting events happen in the recognition of the verb 
phrase, "analyze dump." The first is that a morphologi- 
cal variant of the verb "analyze" is identified ("analyz- 
ing" being the morphological variant). The second is 
that the phrase was recognized as a single unit even 
though the two words that comprise it were not contig- 
uous in the input text As previously described, the 
natural language module 32 identifies the verb portion 
of the verb phrase and then looks for an occurrence of 
the noun portion in the same sentence. In this case it was 
successful, so the entire verb phrase matches. The list 
that the natural language module 32 outputs as a data 
structure to the intelligent inferencer module 34 as the 
result of parsing this input text would look something 
like this: 



((KEYWORDS: ("BACKUP" 2) ("DATABASE? 1) 

("TK-DEVICE" 1) r AUDISABLED" 1) 
("DUMP FILE" !) ("ANALYZE DUMP" 

0) 

(FACTS: (DEVICE-TYPE: TAPE) (LAYERED-PROD: RDB) 
(ERROR-MESSAGE: AUDISABLED))). 



10 



IS 



((SI: ("BACKUP" 1) ("DATABASE** 1) 

(TK-DEVICE" 1) 

("AUDISABLED" 1) ("DUMP FILE" 1)) 
(($2: r ANALYZE DUMP" 1) ("BACKUP" 1)X 



Notice that the keywords are now not separated into 
sentence groupings and that the information deduced 
by the intelligent inferencer module 34 is incorporated 
into the data structure. This data structure is passed as 
input to the similarity measuring module 36. 

The similarity measuring module 36 calculates a simi- 
larity measure for every category in the keyword/cate- 
gory profiles 56 against the identified keywords in the 
data structure. At this point, the frequency numbers 
associated with each keyword will be replaced by their 
term weights by using the term weighing formulae 
20 previously described- To keep things simple in this 
example, it is a«iim<»H that the term weights remain as 
they are above. For this example, the following cate- 
gory profiles contained in the keyword/category pro- 
files 56: 

' 25 



30 



35 



The numbers with each identified keyword represent 
the frequency of the keyword in the given sentence. 
This data structure is passed as input to the intelligent 
inferencer module 34. 

The intelligent inferencer module 34 uses class infor- 
mation to deduce further information from the input 
text For the purposes of this example, the following 
classes are defined to reside in the keyword class hierar- 
chy 54: 

Keyword Class TAPE-DEVICES, which is a group- 
ing of all specific tape devices and includes 
"TK70", 

Keyword Class RDB-ERROR-MSGS, which is a 
grouping of all the error messages generated by the 40 
product RDB, and includes "AUDISABLED", 

Keyword Class ERROR-MSGS, which is a grouping 
of all possible error messages and includes the key- 
word class RDB-ERROR-MSGS, 

Category Class VTA-PRODUCTS, which is a group- 45 
ing of all the VIA products, including RDB (RDB 
is a category in the domain-specific knowledge 



Category BACKUP 
Category RDB 
Category DBMS 
Category TAPE 
Category BUGCHECK 



has the awvrtitrri keyword 
"BACKUP" 

has the associated keywords 
"DATABASE" and "AUDISABLED" 
has the itttociarwil keyword 
"DATABASE" 

has the associated keyword "TK- 
DEVICE" 

has the associatffd keywords 
"DUMP FILE" and "ANALYZE 
DUMP" 



The following facts are associated with the above 
classes in the keyword class hierarchy 54: 50 



(CLASS a 


TAPE-DEVICE) 


— (DEVICE-TYPE = 






TAPE), 


(CLASS = 


RDB-ERROR-MSGS) 


— *• (LAYERED-PROD = 






RDB), 


(CLASS = 


ERROR-MSGS) 


— (ERROR-MESSAGE = 






Skeyword). 



Given these classes and associated facts, it can be 
deduced that the DEVICE-TYPE is TAPE because of 
the identification of TK70. A potential layered product 
is RDB because of the identification of AUDISA- 
BLED as a RDB error message. An error message 
found in this input text is AUDISABLED. 

The list of recognized keywords and the list of de- 
duced facts output by the intelligent inferencer module 
34 as a data structure would look something like this: 



55 



60 



It is assumed for this example that each keyword for 
each category has a weight of 1. It is also assumed that 
there are other categories in the knowledge base, but 
that none of them have any keywords in their profiles 
that match the keywords found in the input text Also, 
it should be understood that the categories above have 
other keywords in their profiles, but for simplicity, only 
the keywords that match keywords found in the input 
text are presented. The similarity measures for the cate- 
gories above are then as follows: 



Sim(T, BACKUP) = 2 (because "BACKUP" has a keyword 
weight of 2) 

SimCT, RDB) = 2 
SimCT, DBMS) = 1 
Sim(T, TAPE) = 1 
Sim(T, BUGCHECK) = 2 



For this example, a category threshold offset of 0.5 is 
chosen. This means that only the categories with simi- 
larity measures above 1.5 (2—0.5) will pass on to the 
next module. The list of the most similar categories, 
along with the list of recognized keywords and the list 
of deduced facts, that the similarity measuring module. 
36 outputs as a data structure would look something like 
this: 



65 ((KEYWORDS: ("BACKUP** 2) ("DATABASE" 1) 

(TK-DEVICE" l) ("AUDISABLED** 1) 
("DUMP FILE** 1) (• ANALYZE DUMP" 
0) 

(FACTS: (DEVICE-TYPE: TAPE) (LAYERED- 
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PROD: RDB) (ERROR-MESSAGE: 
AUDISABLED)) 
(CATEGORIES: (BACKUP 2) (RDB 2) (BUGCHECK 2))). 



To continue the example, a rule base is selected. 
There are two rules in the category selection rule base 
58 as follows: 



10 



15 



IF (KEYWORD = "BACKUP**) and 

(LAYERED-PROD « VIA-PRODUCTS) 

THEN (PREFER VIA-PRODUCTS OVER BACKUP) 

IF (SKILL = BUGCHECK) and 

(LAYERED-PROD = VIA-PRODUCTS) 

and 

(NOT (EXISTS BUGCHECK-TYPE)) 
THEN (PREFER VIA-PRODUCTS OVER BUGCHECK) 



These two rules use the class VIA-PRODUCTS 
which we defined previously as including the category 20 
RDB. Since the fact (LAYERED-PROD = RDB) is 
present, both of these rules will fire, the result being that 
the RDB category is preferred over both BACKUP and 
BUGCHECK. The final data structure output by the 
category disambiguation module 38, is as follows: 



25 



((KEYWORDS: 



(FACTS: 



(CATEGORIES: 
(PREFERENCES: 



("BACKUP" 2) ("DATABASE" 1) 
("IK-DEVICE" 1) ("AUDISABLED" 1) 
("DUMP FILE" 1) ("ANALYZE DUMP" 30 
D) 

(DEVICE-TYPE: TAPE) (LAYERED- 
PROD: RDB) (ERROR-MESSAGE: 
AUDISABLED)) 
(RDB 2)) 

(RDB OVER BACKUP RULE-1) (RDB 
OVER BUGCHECK RULE-2))). 



35 



The rule numbers are listed with the preferences so 
they can be accessed at run time to generate explana- 
tions to the user as to why the rules fired. 40 

Once a text classification operation is performed, 
control is returned to the invoking application, in this 
case a call handling system. The call handling system 
will then use the classification to route the customer 
service request to the appropriate support group. The 45 
call handling system can also store the service request 
and its classification for later use by the relevance feed- 
back learning module 40 (FIG. 2). After a given prede- 
termined period of time, the call handling system col- 
lects all the service requests and their classifications and 50 
passes them as input to the relevance feedback learning 
module 40. The relevance feedback learning module 40 
takes these classified requests and interacts with a 
human call routing expert via video display terminal 18 
(FIG. 1) to determine which ones were correctly and 55 
which ones were incorrectly routed. This learning mod- 
ule would then take the information from the call rout- 
ing expert and adjust the profile weights in the key- 
word/category profiles 56 as previously described. 

What is claimed is: 60 

1. A method for classifying natural language text 
input into a computer system, the system includes mem- 
ory having a domain specific knowledge base having a 
plurality of categories stored therein, the method com- 
prising the steps of: 65 

(a) accepting as input natural language input text; 

(b) parsing the natural language input text into a first 
list of recognized keywords; 



(c) using the first list to deduce further facts from the 
natural language input text; 

(d) compiling the deduced facts into a second list; 

(e) calculating a numeric similarity score for each one 
of the plurality of categories in the knowledge base 
to indicate how similar one of the plurality of cate- 
gories is to the natural language input text; 

(0 applying a dynamic threshold to determine which 
ones of the plurality of categories are most similar 
to the recognized keywords of the first list, com- 
prising the sub-steps of: 

(I) calculating a value for the dynamic threshold 
based upon a similarity score of a most similar 
category and a predefined threshold offset, and 

(D) classifying the categories based upon their 
respective similarity scores by discarding cate- 
gories whose similarity scores are below the 
threshold value; 
(g) compiling the ones of the plurality of categories 

determined to be most similar in step (0 into a third 

list; and 

(i) passing the first list, the second list and the third 
list to an external application. 

2. The method according to claim 1 wherein the 
keywords comprise words, phrases and regular expres- 
sions. 

3. The method according to claim 1 wherein the 
knowledge base includes a keyword class hierarchy 
structured such that keywords that share something in 
common are grouped into classes, each class has associ- 
ated facts that are true when a member of the class is 
identified in the natural language input text, wherein the 
steps of using the first list to deduce further facts from 
the natural language input text and compiling the de- 
duced facts into a second list further are performed by 
the steps of: 

(a) searching the keyword class hierarchy to deter- 
mine if a keyword identified in the first list is a 
member of a class in the keyword class hierarchy; 

(b) when a keyword identified in the first list is a 
member of a class, 

(i) inferring all the facts attached to that class by 
adding them to the second list, and 

(ii) adding all the facts attached to all classes above 
the classes of which the identified keyword is a 
member in the keyword class hierarchy to the 
second list; and • 

(c) repeating steps (a) through (b) for each keyword 
in the first list 

4. The method according to claim 2 wherein the 
knowledge base includes a keyword class hierarchy 
structured such that keywords that share something in 
common are grouped into classes, each class has associ- 
ated facts that are true when a member of the class is 
identified in the natural language input text, wherein the 
step of using the first list to deduce further facts from 
the natural language input text further comprises the 
step of substituting general descriptions of an identified 
keyword in the first list in an attempt to match other 
phrases that could not be matched explicitly so that a 
group of similar keywords can be grouped into a class 
and a word can be attached to the class to be used as a 
substitute for matching phrases. 

5. The method according to claim 1 wherein the 
knowledge base includes a keyword class hierarchy 
structured such that keywords that share something in 
common are grouped into classes, each class has associ- 
ated facts that are true when a member of the class is 
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identified in the natural language input text, wherein the 
steps of using the first list to deduce further facts from 
the natural language input text and compiling the de- 
duced facts into a second list further are performed by 
the steps of: 5 

(a) searching the keyword class hierarchy for all 
classes of which an identified keyword in the first 
list is a member; 

(b) adding all facts associated with each one of the 
classes of which the identified keyword is a mem- 10 
ber to a global list of deduced facts; 

(c) recursively applying step (b) on all classes above 
the classes of which the identified keyword is a 
member in the keyword class hierarchy; and 

(d) repeating steps (a) through (c) for each keyword IS 
in the first list 

6. The method according to claim 1 wherein the 
knowledge base includes a lexicon that includes words, 
phrases and expressions, and a keyword class hierarchy 
structured such that keywords that share something in 20 
common are grouped into classes, each class has associ- 
ated facts that are true when a member of the class is 
identified in the natural language input text, wherein the 
step of using the first list to deduce further facts from 
the natural language input text further comprises the 25 
steps of: 

(a) searching the keyword class hierarchy for all 
classes of which an identified keyword in the first 
list is a member; 

(b) locating all substitution keywords associated with 30 
each class of which the identified keyword is a 
member; 

(c) retrieving the located substitution keywords; 

(d) substituting the located substitution keywords for 
the identified keyword; 35 

(e) using the located substitution keywords to identify 
matches between the located substitution key- 
words and phrases in the lexicon; 

(f) recursively applying steps (b) through (e) on all 
classes above the classes of which the identified 40 
keyword is a member in the keyword class hierar- 
chy; and 

(g) repeating steps (a) through (f) for each keyword in 
the first list 

7. A text classification system comprising: 45 
memory; 

a domain specific knowledge base stored in said mem- 
ory having a plurality of categories, the domain 
specific knowledge base includes a knowledge base 
of keyword/category profiles, each category in the 50 
keyword/category profiles knowledge base having 
an associated profile which indicates what informa- 
tion provides evidence for a given category, the 
keyword/profile weight knowledge base arranged 
to have associated with each keyword in a profile a 55 
profile weight that represents the . amount of evi- 
dence a keyword provides for a given category; 
and 

a computer coupled to the memory, the computer 
including: 60 
a natural language module for accepting as input . 
into the computer natural language input text, 
the natural language module includes means for 
parsing the natural language input text into a first 
. list of recognized keywords; 65 
an intelligent inferencer module for using the first 
list to deduce further facts from the information 
explicitly stated in the natural language input 



text, the intelligent inferencer module includes 
means for compiling the deduced facts into a 
second list; 

a similarity measuring module for calculating a 
numeric similarity score for each one of the plu- 
rality of categories in the knowledge base to 
indicate how similar one of the plurality of cate- 
gories is to the natural language input text, the 
similarity measuring module includes: 
means for applying a dynamic threshold to deter- 
mine which ones of the plurality of categories 
are most similar to the recognized keywords 
of the natural language input text, and 
means for compiling the ones of the plurality of 
categories determined to be most similar into a 
third list; and 
a relevance feedback learning module for adjusting 
the profile weights in the keyword/category 
profiles in the domain specific knowledge base 
based upon the ones of the plurality of categories 
determined most relevant to the natural language 
input text by the similarity measuring module 
and a second ones of the plurality of categories 
determined most relevant to the natural language 
input text by an external source. 

8. A method for classifying natural language text 
input into a computer system, the system includes mem- 
ory having a domain specific knowledge base having a 
plurality of categories stored therein, the method com- 
prising the steps of: 

(a) accepting as input natural language input text; 

(b) parsing the natural language input text into a first 
list of recognized keywords; 

(c) using the first list to deduce further facts from the 
natural language input text; 

(d) compiling the deduced facts into a second list; 

(e) calculating a numeric similarity score for each one 
of the plurality of categories in the knowledge base 
to indicate how similar one of the plurality of cate- 
gories is to the natural language input text; 

(f) applying a dynamic threshold to determine which 
ones of the plurality of categories are most similar 
to the recognized keywords of the first list, the step 
of applying a dynamic threshold further compris- 
ing the sub-steps of: 

(1) calculating a value for the dynamic threshold 
based upon a similarity score of a most similar 
category and a predefined threshold offset, and 

(2) classifying the categories based upon their re- 
spective similarity scores by discarding catego- 
ries whose similarity scores are below the thresh- 
old value; and 

(g) compiling the ones of the plurality of categories 
determined to be most similar in step (f) into a third 
list 

9. The method according to claim 1 wherein the 
domain specific knowledge base further includes a rule 
base, the method further comprising the steps of: 

(a) utilizing the rule base to select certain ones of the 
plurality of categories determined to be most simi- 
lar to the recognized keywords over other ones of 
the plurality of categories based on the first and 
second lists; and 

(b) modifying the third list of the most similar catego- 
ries to include the certain ones of the plurality of 
categories selected. 

10. The method according to claim 1 wherein the 
domain specific knowledge base includes a knowledge 
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base of keyword/category profiles, each category in the 
keyword/category profiles knowledge base having an 
associated profile which indicates what information 
provides evidence for a given category, the keyword/- 
profile weight knowledge base is arranged to have asso- 5 
dated with each keyword in a profile a profile weight 
that represents the amount of evidence a keyword pro- 
vides for a given category, the method further compris- 
ing the step of adjusting the profile weights in the key- 
word/category profiles in the domain specific knowl- 1° 
edge base based upon the ones of the plurality of catego- 
ries determined most relevant to the natural language 
input text and a second ones of the plurality of catego- 
ries determined most relevant to the natural language 
input text by an external source. 15 

11. A method for routing customer service requests 
by a computer system in a customer support center 
which includes support groups to service customer 
requests, the computer system including a call handling 
system, a text classification system and memory having 
a domain specific knowledge base having a plurality of 
categories stored therein representative of the support 
groups within the customer support center, each sup- 
port group being identified by a name, the method com- ^ 
prising the steps of: 

(a) receiving a customer service request by the com- 
puter system from the call handling system; 

(b) passing the customer service request to the text 
classification system to determine where to route 3Q 
the customer service request within the customer 
support center; 

(c) parsing the customer service request into a first list 
of recognized keywords; 

(d) using the first list to deduce further facts from the 35 
customer service request; 

(e) compiling the deduced facts into a second list; 

(f) calculating a numeric similarity score for each one 
of the plurality of categories in the knowledge base 
to indicate how similar each one of the plurality of 
categories is to the the customer service request; 

(g) applying a dynamic threshold to identify which 
one of the support groups should handle the cus- 
tomer service request by determining which ones 
of the plurality of categories are most similar to the 45 
recognized keywords of the customer service re- 
quest; 

(h) compiling the ones of the plurality of categories 
determined to be most similar in step (g) into a third 
list; 50 

(i) passing the first list, the second list and the third 
list back to the call handling system; and 

(j) routing the customer service request to the identi- 
fied one of the support groups. 

12. A method for routing customer service requests 55 
by a computer system in a customer support center 
which includes support groups to service customer 
requests, the computer, system including a call handling 
system, a text classification system and memory having 

a domain specific knowledge base having a plurality of 60 
categories stored therein representative of the support 
groups within the customer support center, each sup- 
port group being identified by a name, and a rule base, 
the method comprising the steps of: 

(a) receiving a customer service request by the com- 65 
puter system from the call handling system; 

(b) passing the customer service request to the text 
classification system to determine where to route 



807 

20 

the customer service request within the customer 
support center, 

(c) parsing the customer service request into a first list 
of recognized keywords; 

(d) using the first list to deduce further facts from the 
customer service request; 

(e) compiling the deduced facts into a second list; 

(f) calculating, utilizing the first list, a numeric simi- 
larity score for each one of the plurality of catego- 
ries in the knowledge base to indicate how similar 
each one of the plurality of categories is to the 
customer service request; 

(g) applying a dynamic threshold to identify which 
support groups should handle the customer service 
request by determining which ones of the plurality 
of categories are most similar to the recognized 
keywords of the customer service request; 

(h) compiling the ones of the plurality of categories 
determined to be most similar in step (g) into a third 
list; 

(i) utilizing the rule base to select certain ones of the 
plurality of categories determined to be most simi- 
lar to the recognized keywords over other ones of 
the plurality of categories based on the first and 
second lists; 

(j) modifying the third list of the most similar catego- 
ries to include the certain ones of the plurality of 
categories selected; 

(k) passing the first list, the second list and the third 
list back to the call handling system; and 

(I) routing the customer service request to the se- 
lected one of the support groups. 

13. The method according to claim 11 or 12 wherein 
the domain specific knowledge base includes a knowl- 
edge base of keyword/category profiles, each category 
in the keyword/category profiles knowledge base hav- 
ing an associated profile which indicates what informa- 
tion provides evidence for a given category, the key- 
word/profile weight knowledge base is arranged to 
have associated with each keyword in a profile a profile 
weight that represents the amount of evidence a key- 
word provides for a given category, the method further 
comprising the step of adjusting the profile weights in 
the keyword/category profiles in the domain specific 
knowledge base based upon the one of the support 
groups selected to handle the customer service request 
and a second one of the support groups determined 
most relevant to the natural language input text by an 
external source. 

14. A text classification system comprising: 
a memory; 

a domain specific knowledge base stored in said mem- 
ory having a plurality of categories wherein the 
domain specific knowledge base includes a knowl- 
edge base of keyword/category profiles, each cate- 
gory in the keyword/category profiles knowledge 
base having an associated profile which indicates 
what information provides evidence for a given 
category, the keyword/profile knowledge base is 
arranged to have associated with each keyword in 
a profile a profile weight that represents the 
amount of evidence a keyword provides for a given 
category; and 

a computer coupled to the memory, the computer 
including: 

means for accepting as input into the computer, 
natural language input text, 
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means for parsing the natural language input text 
into a first list of recognized keywords, 

means for using the first list to deduce further facts 
from the natural language input test, 

means for compiling the deduced facts into a sec- 5 
ond list, 

means for calculating a numeric similarity score for 
each one of the plurality of categories in the 
knowledge base to indicate how similar one of 
the plurality of categories is to the natural lan- 10 
guage input text, 

means for applying a dynamic threshold to deter- 
mine which ones of the plurality of categories 
are most similar to the recognized keywords of 
the first list, 15 

means for adjusting the profile weights in the key- 
word/categories determined to be the most rele- 
vant to the natural language input text and a 
second ones of the plurality of categories deter- 
mined most relevant to the natural language 
input text by an external source, 

means for compiling the ones of the plurality of 
categories determined to be most similar into a 
third list, and 

means for passing the first list, the second list and 
the third list to an external application. 

15. The text classification system according to claim 
14 wherein the keywords comprises words, phrases and 
regular expressions. ^ 

16. The text classification system according to claim . 
14 wherein the domain specific knowledge base further 
includes a rule base and the computer further com- 
prises: 

means for utilizing the rule base to select certain ones 35 
of the plurality of categories that were determined 
to be most similar to the recognized keywords over 
other ones of the plurality of categories based on 
the first and second lists; and 

means for modifying the third list of the most similar $q 
categories to include the certain ones of the plural- 
ity of categories selected. 

17. The text classification system according to claim 
14 wherein the domain specific knowledge base in- 
cludes a knowledge base of keyword/category profiles, 45 
each category in the keyword/category profiles knowl- 
edge base having an associated profile which indicates 
what information provides evidence for a given cate- 
gory, the keyword/profile weight knowledge base is 
arranged to have associated with each keyword in a 50 
profile a profile weight that represents the amount of 
evidence a keyword provides for a given category, 
wherein the computer further comprises means for 
adjusting the profile weights in the keyword/category 
profiles in the domain specific knowledge base based 55 
upon the ones of the plurality of categories determined 
most relevant to the natural language input text and a 
second ones of the plurality of categories determined 
most relevant to the natural language input text by an 
external source. 60 

18. A method for classifying natural language text 
input into a computer system, the system includes mem- 
ory having a domain specific knowledge base having a 
plurality of categories stored therein and including a 
rule base, the method comprising the steps of: 65 

(a) accepting as input natural language input text; 

(b) parsing the natural language input text into a first 
list of recognized keywords; 



(c) using the first list to deduce further facts from the 
natural language input text; 

(d) compiling the deduced facts into a second list; 

(e) calculating a numeric similarity score for each one 
of the plurality of categories in the knowledge base 
to indicate how similar one of the plurality of cate- 
gories is to the natural language input text; 

(Q applying a dynamic threshold to determine which 
ones of the plurality of categories are most similar 
to the recognized keywords of the first list; 

(g) compiling the ones of the plurality of categories 
deterrnined to be most similar in step (f) into a third 
list; 

(h) utilizing the rule base to select certain ones of the 
plurality of categories determined to be most simi- 
lar to the recognized keywords over other ones of 
the plurality of categories based on the first and 
second lists; and 

(i) modifying the third list of the most similar catego- 
ries to include the certain ones of the plurality of 
categories selected. 

19. The text classification system according to claim 
14 wherein the means for applying a dynamic threshold 
further includes: 

means for calculating a value for the dynamic thresh- 
old based upon a similarity score of a most similar 
category and a predefined threshold offset; and 

means for classifying the categories based upon their 
respective similarity scores by discarding catego- 
ries whose similarity scores are below the thresh- 
old value. 

20. A method for classifying natural language text 
input into a computer system, the system includes mem- 
ory having a domain specific knowledge base having a 
plurality of categories stored therein, the knowledge 
base including a lexicon that includes words, phrases 
and expressions and a keyword class hierarchy struc- 
tured such that keywords that share something- in com- 
mon are grouped into classes, each class has associated 
facts that are true when a member of the class is identi- 
fied in the natural language inputs text, the method 
comprising the steps of: 

(a) accepting as input natural language input text; 

(b) parsing the natural language input text into a first 
list of recognized keywords; 

(c) using the first list to deduce further facts from the 
natural language input text comprising the sub- 
steps of: 

(1) searching the keyword class hierarchy for all 
classes of which an identified keyword in the 
first list is a member, 

(2) locating all substitution keywords associated 
with each class of which the identified keyword 
is a member, 

(3) retrieving the located substitution keywords, 

(4) substituting the located substitution keywords 
for the identified keyword, 

(5) using the located substitution keywords to iden- 
tify matches between the located substitution 
keywords and phrases in the lexicon, 

(6) recursively applying sub-steps (2) through (5) 
on all classes above the classes of which the 
identified keyword is a member in the keyword 
class hierarchy, and 

(7) repeating sub-steps (1) through (6) for each 
keyword in the first list; 

(d) compiling the deduced facts into a second list; 
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(e) calculating a numeric similarity score for each one 
of the plurality of categories in the knowledge base 
to indicate how similar one of the plurality of cate- 
gories is to the natural language input text; 

(f) applying a dynamic threshold to determine which 5 
ones of the plurality of categories are most similar 
to the recognized keywords of the first list; and 

(g) compiling the ones of the plurality of categories 
determined to be most similar in step (f) into a third 
list 10 

21. A text classification system comprising: 
memory; 

a domain specific knowledge base stored in said mem- 
ory having a plurality of categories, the domain 
specific knowledge base including a rule base; and 15 

a computer coupled to the memory, the computer 
including: 

a natural language module for accepting as input 
into the computer natural language input text, 
the natural language module includes means for 20 
parsing the natural language input text into a first 
list of recognized keywords; 

an intelligent inferencer module for using the first 
list to deduce further facts from the information 
explicitly stated in the natural language input 25 
text, the intelligent inferencer module includes 
means for compiling the deduced facts into a 
second list; 

a similarity measuring module for calculating a 
numeric similarity score for each one of the plu- 30 
rality of categories in the knowledge base to 
indicate how similar one of the plurality of cate- 
gories is to the natural language input text, the 
similarity measuring module includes: 
means for applying a dynamic threshold to deter- 35 
mine which ones of the plurality of categories 
are most similar to the recognized keywords 
of the natural language input text, and 
means for compiling the ones of the plurality of 
categories determined to be most similar into a 40 
third list; and 
a category disambiguation module for utilizing the 
rule base to select certain ones of the plurality of 
categories determined to be most similar to the 
recognized keywords over other ones of the 45 
plurality of categories based on the first and 
second lists, the category disambiguation module 
includes means for modifying the third list of the 
most similar categories to include the certain 
ones of the plurality of categories selected. 50 

22. A text classification system comprising: 
a memory; 

a domain specific knowledge base stored in said mem- 
ory having a rule base and a plurality of categories; 
and 55 

a computer coupled to the memory, the computer 
including: 

means for accepting as input into the computer, 
natural language input text, 

means for parsing the natural language input text 60 
into a first list of recognized keywords, 

means for using the first list to deduce further facts 
from the natural language input text, 

means for compiling the deduced facts into a sec- 
ond list, 65 

means for calculating a numeric similarity score for 
each one of the plurality of categories in the 
knowledge base to indicate how similar one of 



the plurality of categories is to the natural lan- 
guage input text, 
means for applying a dynamic threshold to deter- 
mine which ones of the plurality of categories 
are most similar to the recognized keywords of 
the first list, 

means for compiling the ones of the plurality of 
categories determined to be most similar into a 
third list, 

means for utilizing the rule base to select certain 
ones of the plurality of categories that were de- 
termined to be most similar to the recognized 
keywords over other ones of the plurality of 
categories based on the first and second lists, and 

means for modifying the third list of the most simi- 
lar categories to include the certain ones of the 
plurality of categories selected. 

23. A text classification system comprising: 
a memory; 

a domain specific knowledge base stored in said mem- 
ory having a plurality of categories; and 

a computer coupled to the memory, the computer 
including: 

means for accepting as input into the computer, 
natural language input text, 

means for parsing the natural language input text 
into a first list of recognized keywords, 

means for using the first list to deduce further facts 
from the natural language input text, 

mpanc for compiling the deduced facts into a sec- 
ond list, 

means for calculating a numeric siinilarity score for 
each one of the plurality of categories in the 
knowledge base to indicate how similar one of 
the plurality of categories is to the natural lan- 
guage input text, 

means for applying a dynamic threshold to deter- 
mine which ones of the plurality of categories 
are most similar to the recognized keywords of 
the first list, 

means for calculating a value for the dynamic 
threshold based upon a similarity score of a most 
similar category and a predefined threshold off- 
set, 

means for classifying the categories based upon 
their respective similarity scores by discarding 
categories whose similarity scores are below the 
threshold value, and 

means for compiling the ones of the plurality of 
categories determined to be most similar into a 
third list 

24. A text classification system comprising: 
a memory; 

a domain specific knowledge base stored in said mem- 
ory having a plurality of categories, the domain 
specific knowledge base including a knowledge 
base of keyword/category profiles, each category 
in the keyword/category profiles knowledge base 
having an associated profile which indicates what 
information provides evidence for a given cate- 
gory, the keyword/profile weight knowledge base 
is arranged to have associated with each keyword 
in a profile a profile weight that represents the 
amount of evidence a keyword provides for a given 
category; and 

a computer coupled to the memory, the computer 
including: 
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means for accepting as input into the computer, 
natural language input text, 

means for parsing the natural language input text 
into a first list of recognized keywords, 

means for using the first list to deduce further facts S 
from the natural language input text, 

means for compiling the deduced facts into a sec- 
ond list, 

means for calculating a numeric similarity score for 
each one of the plurality of categories in the 10 
knowledge base to indicate how similar one of 
the plurality of categories is to the natural lan- 
guage input text, 

means for applying a dynamic threshold to deter- 
mine which ones of the plurality of categories 15 
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are most similar to the recognized keywords of 
the first list, 

means for compOing the ones of the plurality of 
categories determined to be most similar into a 
third list, and 

means for adjusting the profile weights in the key- 
word/category profiles in the domain specific 
knowledge base based upon the ones of the plu- 
rality of categories determined most relevant to 
the natural language input text and a second ones 
of the plurality of categories determined most 
relevant to the natural language input text by an 

external source. 
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